Web Crawling as an AI Project
نویسنده
چکیده
This paper argues for the introduction of real-world programming projects into AI curricula, specifically using Python as an implementation language. We describe a modular set of projects centered around a focused web crawler, along with potential extensions. The author’s experiences using this project in a class of undergraduates and Master’s students are also discussed.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملARCOMEM Crawling Architecture
The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limita...
متن کاملRIDIRE-CPI: an Open Source Crawling and Processing Infrastructure for Web Corpora Building
This paper introduces the RIDIRE-CPI, an open source tool for the building of web corpora with a specific design through a targeted crawling strategy. The tool has been developed within the RIDIRE Project, which aims at creating a 2 billion word balanced web corpus for Italian. RIDIRE-CPI architecture integrates existing open source tools as well as modules developed specifically within the RID...
متن کاملCollaborative Web Crawler over High-speed Research Network
This paper proposes an idea for constructing a distributed web crawler by utilizing existing high-speed research networks. This is an initial effort of the Web Language Engineering (WLE) project which investigates techniques in processing the languages found in published web documents. In this paper, we focus on designing a geographically distributed web crawler. Multiple crawlers work collabor...
متن کاملAugmenting Focused Crawling Using Search Engine Queries
The pervasiveness of the Internet makes it an ideal medium for sharing scholarly information. Nowadays, many authors post their publications online so that others may easily access to them, increasing the author’s impact in his/her research area. In this project, we develop a focused crawling to find publication pages, web pages that link to online, freely available scholarly publications. In c...
متن کامل